{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0687a777-4363-4d03-8d67-92a2342dc569",
   "metadata": {
    "papermill": {
     "duration": 0.057726,
     "end_time": "2022-10-24T06:29:02.577249",
     "exception": false,
     "start_time": "2022-10-24T06:29:02.519523",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "# 2. Exploratory data analysis - In-depth profiling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef0e9941",
   "metadata": {
    "papermill": {
     "duration": 0.051636,
     "end_time": "2022-10-24T06:29:02.687919",
     "exception": false,
     "start_time": "2022-10-24T06:29:02.636283",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "The first step in a data preparation pipeline is the exploratory data analysis (EDA). In a nutshell, data exploration and data cleansing are hand-to-hand and both are mutually iterative steps.\n",
    "\n",
    "*But what does data exploration includes? And how to make a better data exploration giving we are building a credit scorecard model?*\n",
    "\n",
    "\n",
    "Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions to correlations, cross-tabulation, and characteristic analysis.\n",
    "add here detail about pandas-profiling and data exploration in general (re-use the sentence above)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8619741d-a1d4-43ca-978d-2bbe16326d53",
   "metadata": {
    "papermill": {
     "duration": 0.047898,
     "end_time": "2022-10-24T06:29:02.785015",
     "exception": false,
     "start_time": "2022-10-24T06:29:02.737117",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "## Read the data & computed metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3359333b-911c-4b81-b47e-d3ae73db3ee0",
   "metadata": {
    "papermill": {
     "duration": 0.052874,
     "end_time": "2022-10-24T06:29:06.344787",
     "exception": false,
     "start_time": "2022-10-24T06:29:06.291913",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "### Import needed packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73932d1c-5b2f-4145-81b8-bc6a4724cbea",
   "metadata": {
    "papermill": {
     "duration": 2.185995,
     "end_time": "2022-10-24T06:29:08.587322",
     "exception": false,
     "start_time": "2022-10-24T06:29:06.401327",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from pickle import load\n",
    "\n",
    "import pandas as pd\n",
    "from ydata.labs.datasources import DataSources\n",
    "from ydata.metadata import Metadata\n",
    "from ydata.profiling import ProfileReport\n",
    "from ydata.utils.formats import read_json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d67b3ce6-cb30-4115-8f0a-bd5dba7b91fe",
   "metadata": {
    "papermill": {
     "duration": 6.06512,
     "end_time": "2022-10-24T06:29:14.706508",
     "exception": false,
     "start_time": "2022-10-24T06:29:08.641388",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "dataset = DataSources.get(uid=\"973d95c7-e6bd-4535-a0ea-d3dd1e893b13\").read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae99d996-326f-43d9-8d97-566bcdd4813f",
   "metadata": {
    "papermill": {
     "duration": 0.08175,
     "end_time": "2022-10-24T06:29:14.949532",
     "exception": false,
     "start_time": "2022-10-24T06:29:14.867782",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "meta = Metadata.load(\"metadata.pkl\")\n",
    "print(meta)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3080656-b0a5-4b99-a61e-ffabdf58f165",
   "metadata": {
    "papermill": {
     "duration": 0.055638,
     "end_time": "2022-10-24T06:29:15.056669",
     "exception": false,
     "start_time": "2022-10-24T06:29:15.001031",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "## Generating the full data profile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6bdca1ad-9609-41ef-9d04-46cee35ed208",
   "metadata": {
    "papermill": {
     "duration": 0.15526,
     "end_time": "2022-10-24T06:29:15.262701",
     "exception": false,
     "start_time": "2022-10-24T06:29:15.107441",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "data_path = os.environ.get(\"DATASET_PATH\", \"\")\n",
    "data_name = os.environ.get(\"DATASET_NAME\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b7b0d4b-762d-4f53-aa8e-b14b8f26e532",
   "metadata": {
    "papermill": {
     "duration": 35.399872,
     "end_time": "2022-10-24T06:29:50.722568",
     "exception": false,
     "start_time": "2022-10-24T06:29:15.322696",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(f\"Profile Name: {data_name}_profile\")\n",
    "profile = ProfileReport(df=data, title=\"Data profiling\")\n",
    "profile.config.html.navbar_show = False\n",
    "\n",
    "profile.to_file(f\"{data_name}_profile.html\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7884d159-747d-4abc-b957-5c2a9e137b27",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    },
    "papermill": {
     "duration": 0.194447,
     "end_time": "2022-10-24T06:29:51.330095",
     "exception": false,
     "start_time": "2022-10-24T06:29:51.135648",
     "status": "completed"
    },
    "pycharm": {
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "metadata = {\n",
    "    \"outputs\": [\n",
    "        {\n",
    "            \"type\": \"table\",\n",
    "            \"storage\": \"inline\",\n",
    "            \"format\": \"csv\",\n",
    "            \"header\": list(ratio_labels.columns),\n",
    "            \"source\": ratio_labels.to_csv(header=False, index=True),\n",
    "        },\n",
    "        {\n",
    "            \"type\": \"web-app\",\n",
    "            \"storage\": \"inline\",\n",
    "            \"source\": profile.to_html(),\n",
    "        },\n",
    "    ]\n",
    "}\n",
    "\n",
    "with open(\"mlpipeline-ui-metadata.json\", \"w\") as metadata_file:\n",
    "    json.dump(metadata, metadata_file)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 52.848466,
   "end_time": "2022-10-24T06:29:54.019521",
   "environment_variables": {},
   "exception": null,
   "input_path": "/mnt/laboratory/1ffabd97-374f-483a-885a-d025b3a2e2ef/02_Data_profiling.ipynb",
   "output_path": "/mnt/laboratory/1ffabd97-374f-483a-885a-d025b3a2e2ef/02_Data_profiling.ipynb",
   "parameters": {},
   "start_time": "2022-10-24T06:29:01.171055",
   "version": "2.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
